---
data_dictionary: sharing.xlsx
2024-08-27
A quality control program is easiest
to implement from the top down.
Make sure that you understand the
the commitment of time and money
that is involved. Every workplace is
different, but think about allocating
10% of your time and 10% of the
time of all your employees to
quality control.
Image of a haemocytometer
---
data_dictionary: sharing.xlsx
source: |
Saginova, Olga (2020), “Dataset on the
questionnaire-based survey of sharing
services users’ motivation”, Mendeley Data,
V1, doi: 10.17632/c5k8wjrhd9.1
description: |
From the original source: "The data set
presents data collected by online survey
with a questionnaire using Likert scale.
The survey sample included 184 adults (18+),
active and potential users of different
sharing services platforms."
copyright: |
CC By 4.0. You can share, copy and modify
this dataset so long as you give appropriate
credit, provide a link to the CC BY license,
and indicate if changes were made, but you
may not do so in a way that suggests the
rights holder has endorsed you or your use
of the dataset. Note that further permission
may be required for any content within the
dataset that is identified as belonging to a
third party.
format:
proprietary (Excel)
varnames:
first row of data
missing_value_code:
not applicable
size:
rows: 184
columns: 31
age:
label: How old are you?
values:
- "18-25"
- "26-35"
- "36-45"
- "46-60"
- over 60
gender:
values:
- F
- M
employment_status:
label: Are you employed?
values:
- employed
- entrepreneur
- full-time student
- self-employed
- temporarily unemployed
- unemployed
---
title: "Counts and percentages"
format:
html:
slide-number: true
embed-resources: true
editor: source
execute:
echo: true
message: false
warning: false
---
## Data source
This program uses data from a study of sharing services (like sharing an automobile) and produces counts and percentages for a few demographic variables. There is a [data dictionary][dd] that provides more details about the data.
[dd]: https://github.com/pmean/datasets/blob/master/sharing.yaml
## Libraries
Here are the libraries you need for this program.
```{r setup}
library(readxl)
library(tidyverse)
```
## Reading the data
Here is the code to read the data and show a glimpse. There are 31 columns total, but I am showing just a few of the columns here.
```{r read}
fn <- "../data/sharing.xlsx"
sharing <- read_excel(fn)
glimpse(sharing[ , c(1, 5:7)])
```
## Calculate counts and percentages for age group
```{r count-age-groups}
sharing |>
group_by(age) |>
summarize(n=n()) |>
mutate(total=sum(n)) |>
mutate(pct=100*n/total)
```
The survey respondents were younger than the general population. About half of the survey respondents were 18 to 25 years old. Only 3% were over 60. Six ages were missing.
room before after
1 121 11.8 10.1
2 163 8.2 7.2
3 125 7.1 3.8
4 264 14.0 12.0
5 233 10.8 8.3
6 218 10.1 10.5
7 324 14.6 12.1
8 325 14.0 13.7
\(\frac{1}{8}(11.8+8.2+7.1+14+10.8+10.1+14.6+14)\)
\(= \frac{1}{8}(90.6)\)
\(= 11.325\)
The average colony count per cubic foot before remediation, 11.3, is quite large.
\(\frac{1}{8}(10.1+7.2+3.8+12+8.3+10.5+12.1+13.7)\)
\(= \frac{1}{8}(77.7)\)
\(= 9.7125\)
The average colony count per cubic foot after remediation, 9.7, is smaller, but still quite large.
Here is the sorted data.
room before
1 125 7.1
2 163 8.2
3 218 10.1
4 233 10.8
5 121 11.8
6 264 14.0
7 325 14.0
8 324 14.6
Here are the middle two observations
room before middle
1 125 7.1
2 163 8.2
3 218 10.1
4 233 10.8 10.8
5 121 11.8 11.8
6 264 14.0
7 325 14.0
8 324 14.6
Average the two middle observations
room before middle median
1 125 7.1
2 163 8.2
3 218 10.1
4 233 10.8 10.8 11.3
5 121 11.8 11.8
6 264 14.0
7 325 14.0
8 324 14.6
Here is the sorted data.
room after
1 125 3.8
2 163 7.2
3 233 8.3
4 121 10.1
5 218 10.5
6 264 12.0
7 324 12.1
8 325 13.7
Here are the middle two observations
room after middle
1 125 3.8
2 163 7.2
3 233 8.3
4 121 10.1 10.1
5 218 10.5 10.5
6 264 12.0
7 324 12.1
8 325 13.7
Average the two middle observations
room after middle median
1 125 3.8
2 163 7.2
3 233 8.3
4 121 10.1 10.1 10.3
5 218 10.5 10.5
6 264 12.0
7 324 12.1
8 325 13.7
Excerpt from Gould 1985 publication
Chen et al 2019
Background: The prices of newly approved cancer drugs have risen over the past decades. A key policy question is whether the clinical gains offered by these drugs in treating specific cancer indications justify the price increases.
Results: We found that between 1995 and 2012, price increases outstripped median survival gains, a finding consistent with previous literature. Nevertheless, price per mean life-year gained increased at a considerably slower rate, suggesting that new drugs have been more effective in achieving longer-term survival. Between 2013 and 2017, price increases reflected equally large gains in median and mean survival, resulting in a flat profile for benefit-adjusted launch prices in recent years.
Here is the sorted data.
room before
1 125 7.1
2 163 8.2
3 218 10.1
4 233 10.8
5 121 11.8
6 264 14.0
7 325 14.0
8 324 14.6
Calculate 0.75*(8+1) = 6.75. Select the 6th and 7th observations
room before pick
1 125 7.1
2 163 8.2
3 218 10.1
4 233 10.8
5 121 11.8
6 264 14.0 14
7 325 14.0 14
8 324 14.6
Average the two observations
room before pick q3
1 125 7.1
2 163 8.2
3 218 10.1
4 233 10.8
5 121 11.8
6 264 14.0 14 14
7 325 14.0 14
8 324 14.6
Here is the sorted data.
room after
1 125 3.8
2 163 7.2
3 233 8.3
4 121 10.1
5 218 10.5
6 264 12.0
7 324 12.1
8 325 13.7
Calculate 0.75*(8+1) = 6.75. Select the 6th and 7th observations
room after pick
1 125 3.8
2 163 7.2
3 233 8.3
4 121 10.1
5 218 10.5
6 264 12.0 12
7 324 12.1 12.1
8 325 13.7
Average the two observations
room after pick q3
1 125 3.8
2 163 7.2
3 233 8.3
4 121 10.1
5 218 10.5
6 264 12.0 12 12.05
7 324 12.1 12.1
8 325 13.7
\[S = \sqrt{\frac{1}{n-1}\Sigma(X_i-\bar{X})^2}\]
At least one alternative formula.
---
data_dictionary: "legionnaire's disease"
format:
txt: tab-delimited
varnames:
first row of data
missing_value_code:
not needed
description: >
Fictional data on bacteria counts before
and after air conditioning maintenance.
additional_description:
https://dasl.datadescription.com/datafile/legionnaires-disease
download_url:
https://dasl.datadescription.com/download/data/3310
notes: >
The use of a space in the first variable name might
cause some minor difficulties during import.
source: >
DASL (Data and Story Library), a repository for various
data sets useful for teaching.
copyright: >
Unknown. You should be able to use this data for
individual educational purposes under the Fair Use
guidelines of U.S. copyright law.
size:
rows: 8
columns: 2
vars:
Room number:
label: Hotel room number
Before:
label: Bacterial count before maintenance
unit: colonies per cubic foot
After:
label: Bacterial count before maintenance
unit: colonies per cubic foot
---
---
title: "Univariate statistics for Legionnaires disease"
format:
html:
slide-number: true
embed-resources: true
editor: source
execute:
echo: true
message: false
warning: false
---
## Data source
This program uses data from a fictional study of Legionnaires disease and produces some simple univariate statistics: means, standard deviations, and percentiles. There is a [data dictionary][dd] that provides more details about the data.
[dd]: https://github.com/pmean/data/blob/main/files/legionnaires-disease.yaml
## Libraries
Here are the libraries you need for this program.
```{r setup}
library(tidyverse)
```
## Reading the data
Here is the code to read the data and show a glimpse. There are 31 columns total, but I am showing just a few of the columns here.
```{r read}
fn <- "../data/legionnaires-disease.txt"
ld_raw_data <- read_tsv(fn, col_types="cnn")
glimpse(ld_raw_data)
```
## Rename, 1
Notice how R encloses the first variable name (Room Number) in back-quotes. This is needed when a variable includes an embedded blank. You should rename this variable at your first opportunity.
```{r rename-1}
names(ld_raw_data)[1] <- "Room_Number"
glimpse(ld_raw_data)
```
## Rename, 2
I find that many of the mistakes that I make are due to inconsistencies in how I name variables. Capitalization is one of the biggest problems. So I have gotten into the habit of converting variable names to all lower case. That way I don't have to worry about whether it is "Before" or "before". Here is the code to convert every capital letter to a lowercase letter.
```{r rename-2}
names(ld_raw_data) <- tolower(names(ld_raw_data))
glimpse(ld_raw_data)
```
## Calculate means and standard deviations before remediation
```{r before-means}
ld_raw_data |>
summarize(
before_mn=mean(before),
before_sd=sd(before))
```
The average colony count per cubic foot before remediation, 11.3, is quite large. The standard deviation, 2.8, represents a moderate amount of variation in this variable.
## Calculate means and standard deviations after remediation
```{r after-means}
ld_raw_data |>
summarize(
after_mn=mean(after),
after_sd=sd(after))
```
The average colony count per cubic foot after remediation, 9.7, is still quite large. The standard deviation, 3.2, represents a moderate amount of variation in this variable and is roughly comparable to the variation before remediation.
## Calculate median and range before intervention
You could also use "median(before)" and "min(before)" and "max(before)" in the code below.
```{r before-quantiles}
ld_raw_data |>
summarize(
before_median=quantile(before, probs=0.5),
before_min=quantile(before, probs=0),
before_max=quantile(before, probs=1))
```
The median colony count before remediation, 11.3, is roughly the same as the mean. The data ranges from 7.1 to 14.6 colonies per cubic centimeter, a fairly wide range.
## Calculate median and range after intervention
```{r after-quantiles}
ld_raw_data |>
summarize(
after_q50=quantile(after, probs=0.5),
after_min=quantile(after, probs=0),
after_max=quantile(after, probs=1))
```
The median colony count, 10.3, is slightly lower after remediation. The data range from 3.8 to 13.7 colonies per cubic centimeter and is about as wide as the range before remediation.
## Additional comments
The names that you choose for the left hand side of the equal sign are arbitrary. You should choose a descriptive name, but you have lots of options. A median of the before and after values could be called
- Before_median, After_median
- Median0, Median1
- Second_quartile_A, Second_quartile_B
- or many other reasonable choices.
## Calculate a change score
For data like this with two measurements before and after an intervention, you should compute a change score. The way the computations are done below, a positive value means a reduction in colony counts. Note that any time you make a major change in a dataset, you should save it with a different name. That makes it easier for you to back up if you end up going down a blind alley.
```{r}
ld_raw_data |>
mutate(change=before-after) -> ld_change_scores
glimpse(ld_change_scores)
```